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Abstract — We consider the estimation of an i.i.d. vector x G K" 
from measurements y G obtained by a general cascade model 
consisting of a Itnown linear transform followed by a probabilistic 
componentwise (possibly nonlinear) measurement channel. A 
novel method, called adaptive generalized approximate message 
passing (Adaptive GAMP), that enables joint learning of the 
statistics of the prior and measurement channel along with 
estimation of the unknown vector x is presented. The proposed 
algorithm is a generalization of a recently-developed technique 
by Vila and Schniter that uses expectation-maximization (EM) 
iterations where the posteriors in the E-steps are computed via 
approximate message passing. The proposed methodology can 
be applied to a large class of learning problems including the 
learning of sparse priors in compressed sensing or identification 
of linear-nonlinear cascade models in dynamical systems and 
neural spiking processes. We prove that for large i.i.d. Gaussian 
transform matrices the asymptotic componentwise behavior of 
the adaptive GAMP algorithm is predicted by a simple set 
of scalar state evolution equations. This analysis shows that 
the adaptive GAMP method can yield asymptotically consistent 
parameter estimates, which implies that the algorithm achieves 
a reconstruction quality equivalent to the oracle algorithm 
that knows the correct parameter values. The adaptive GAMP 
methodology thus provides a systematic, general and computa- 
tionally efficient method applicable to a large range of complex 
linear-nonlinear models with provable guarantees. 



I. Introduction 

Consider tlie estimation of a random vector x G M" from a 
measurement vector y G M™. As illustrated in Figure [T] the 
vector X, which is assumed to have i.i.d. components Xj ^ 
Px, is passed through a known linear transform that outputs 
z — Ax G M™. The components of y G M™ are generated 
by a componentwise transfer function PY\z{yi\zi)- This paper 
addresses the cases where the distributions Px and Py\z have 
some parametric uncertainty that must be learned so as to 
properly estimate x. 

This joint estimation and learning problem with linear 
transforms and componentwise nonlinearities arises in a range 
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of applications, including empirical Bayesian approaches to 
inverse problems in signal processing, linear regression and 
classification |[T], Q, and, more recently, Bayesian com- 
pressed sensing for estimation of sparse vectors x from under- 
determined measurements |j3J-|5|. Also, since the parameters 
in the output transfer function Py\z can model unknown 
nonlinearities, this problem formulation can be applied to the 
identification of linear-nonlinear cascade models of dynamical 
systems, in particular for neural spike responses ||6|-|[8j. 

In this work we propose an extension of the Expectation- 
Maximization Generalized Approximate Message Passing 
(EM-GAMP) algorithm of Vila and Schniter g, |[10). In 
that work, the prior Px is described by a generic L-term 
Gaussian mixture (GM) whose parameters are identified by 
an EM procedure | |TT[ . The "expectation" or E-step is per- 
formed by a generalization fTZl of the approximate message 
passing (AMP) algorithm of ||13|. The AMP algorithm is a 
Gaussian approximation of loopy belief propagation that can 
approximately determine the marginal posterior distributions 
of the components Xj given the observations y and the current 
estimates of the distributions Px and Py\z- This posterior 
calculation is the key requirement for the E-step. The AMP al- 
gorithm and related works are reviewed in detail in Section [HI 
Given the posteriors from the E-step, the "maximization" 
or M-step selects a new set of parameters for Px through 
maximum-likelihood estimation. Simulations in Q, show 
remarkably good performance and computational speed for 
EM-GAMP over a wide class of distributions, particularly in 
the context of compressed sensing. 

The main contribution of this paper is to provide a rig- 
orous theoretical justification of the EM-GAMP method for 
large i.i.d. Gaussian matrices A. Specifically, we introduce 
a generalized version of EM-GAMP that we call adaptive 
GAMP. Our main result. Theorem |2] shows that when the 
parameter estimation in the "maximization" or M-step of the 
EM procedure can be computed exactly, the overall parameters 
are provably asymptotically consistent. That means that the 
estimated parameters converge to the true values and the 
performance of adaptive GAMP is asymptotically the same 
as oracle GAMP algorithm with the correct parameter values. 
Moreover, similar to the analysis of the AMP and GAMP 
algorithms in p2| , |fT4)-fr7|, the componentwise asymptotic 
behavior of adaptive GAMP can be described exactly by 
a simple scalar state evolution equations. Adaptive GAMP 
thus provides a computationally-efficient method for a large 



2 



APPROXIMATE MESSAGE PASSING WITH CONSISTENT PARAMETER ESTIMATION 



Px(a-|A,) 


X e R" 


A 


z g 


PY\z{y\z,K) 


y ew 


Unl<nown 


Unknown 


Available 


Signal Prior 


i.i.d. signal 


Mixing Matrix 


Linear 
Measurennents 


Componentwise 
Output Channel 


Measureme 



Algorithm 1 Adaptive GAMP 



Fig. 1: Measurement model considered in this work. The 
vector X G M" with an i.i.d. prior Px{x\\x) passes through 
the linear transform A e followed by a componentwise 

nonlinear channel Py|2(y|z, A^) to result in y e M™. The 
prior Px and the nonlinear channel Py\z depend on the 
unknown parameters A^: and A^, respectively. We propose 
adaptive GAMP to jointly estimate x and (A^;, \z) given the 
measurements y. 



class of joint estimation and learning problems with a simple, 
exact performance characterization and provable conditions for 
asymptotic consistency for large random A. 

II. Adaptive GAMP 

Approximate message passing (AMP) refers to a class of 
algorithms based on Gaussian approximations of loopy belief 
propagation (LBP) for the estimation of the vectors x and z 
according to the model described in Section |l] These methods 
originate from CDMA multiuser detection problems in | [T4) , 
[ITSj, IT9), and, more recently, have attracted considerable 



interest in compressed sensing 1 13|, 1 16 1, 1 17 1. The Gaussian 
approximations used in AMP are closely related to standard 
expectation propagation techniques pO) , ||2T1, but with addi- 
tional simplifications that exploit the linear coupling between 
the variables x and z. The key benefits of AMP methods are 
their computational performance, large domain of application, 
and, for certain large random A, their exact asymptotic perfor- 
mance characterizations with testable conditions for optimality 
p4|-|[T7|. This paper considers an adaptive version of the so- 
called generalized AMP (GAMP) method of |12 | that extends 
the algorithm in IjOij to arbitrary output distributions Py\z- 

The adaptive version of GAMP, described in Algorithm [T] 
extends the original algorithm of | ,12J to allow for simulta- 
neous estimation of parameters in the distributions Px and 
Py\z along with the estimation of x and z. Suppose that the 
distributions Px and Py\z have some parametric forms 



Px{x\K), PY\z{y\z,X.), 



(1) 



for parameters A^ G A^; and A^ e and for parameter sets 
Ax and A^. Algorithm [T] produces a sequence of estimates x* 
and z* for x and z along with parameter estimates A^ and 
A*, The precise value of these estimates depends on several 
factors in the algorithm including the termination criteria and 
the choice of what we will call estimation functions Gl.{-), 
G*(-) and G'(-), and adaptation functions and Hl{-). 

The choice of the estimation and adaptation functions allows 
for considerable flexibility in the algorithm. For example, it is 
shown in |12| that G^(-), G\{-), and G*(-) can be selected 
such that the GAMP algorithm implements Gaussian approx- 
imations of either max-sum LBP or sum-product LBP that 
approximate the maximum a posteriori (MAP) or minimum 
mean squared error (MMSE) estimates of x given y, respec- 
tively. The adaptation functions can also be selected for a 



Require: Matrix A, estimation functions G^(-), G*(-) and 
G*(-) and adaptation functions and 
Initialize i -s— 0, s^^ and some values for x", t°, 
repeat 

{Output node update} 
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number of different parameter-estimation strategies. For space, 
we present only the estimation functions for the sum-product 
GAMP algorithm from 1 12] along with an EM-type adaptation 
from [9|, [ lOj . Some of the analysis below, however, applies 
more generally. 

As described in fTT\, the sum-product estimation can be 
implemented with the functions 



G^(r,T„A,) E[X\R ^ r,Tr, X^], 



(2a) 



Giip,y,Tp,X,) := E[Z\P=p,Y ^y,Tp,X,l (2b) 
Gl{p,y,Tp,X:,) := — (Gl{p,y,Tp,X^) - p] , (2c) 

where the expectations are with respect to the scalar random 
variables 

i? = X + K, T4-AA(0,T,), X^Px{-\K), (3a) 

z = p + y,, v; -AA(o,Tp), Y ^ Py\z{-\zX)- m 

The paper |fl2) shows that the derivatives of these estimation 
functions for lines |9] and 16 are computed via the variances: 



dr 

dGl{p,y,Tp,X) 



wa.v[X\R = r,Tr,Xx] 



dp 



1 



\ar[Z\P ^ p,Y = y,Tp,X^] 



(4a) 



(4b) 



The estimation functions (|2]) correspond to scalar estimates of 
random variables in additive white Gaussian noise (AWGN). 
A key result of 1 12| is that, when the parameters are set to the 
true values (i.e. (A:r,Az) — {Xx,Xz)), the outputs x* and z* 
can be interpreted as sum products estimates of the conditional 
expectations £'(x|y) and £'(z|y). The algorithm thus reduces 
the vector-valued estimation problem to a computationally 



KAMILOV, GOYAL, AND RANGAN 



3 



simple sequence of scalar AWGN estimation problems along 
with linear transforms. 

When the parameters A^; and are unknown, the EM- 
GAMP algorithm of ||9), pO| combines the estimation func- 
tions ^ with parameter estimates A^ and A* computed via 
the adaptation functions: 

1 " 

Hl{r,Tr) := argmax - y]logp(rj |rr, Aa;) (5a) 

Hl{p,y,Tp) argmax — V'logp(yi|pj,Tp,A^)(5b) 

where the densities p{rj\Tr, Xx) and p{yi\pi,Tp, Xz) are the 
likelihoods of the parameters X^. and A^ in the scalar models 
(|3]l. It is shown in [9J, ^lOJ that with the selection of the 
estimation functions (|2| and adaptation functions (|5]l, the 
adaptive GAMP algorithm can be interpreted as a type of EM 
estimation method. Specifically, the computation of the den- 
sities p{rj\Tr, Xx) and p{yi\pi, Tp,Xz) from the scalar models 
Q correspond to the E-step of the EM iteration, while the 
maximizations in (|5]l correspond to the M-step. 

III. Convergence and Asymptotic Consistency 
WITH Gaussian Transforms 

A. General State Evolution Analysis 

Before proving the asymptotic consistency of the EM- 
GAMP method, we first prove a convergence result on the 
general adaptive GAMP algorithm. 

Assumption 1: Consider the adaptive GAMP algorithm run- 
ning on a sequence of problems indexed by the dimension n, 
satisfying the following assumptions: 

(a) For each n, the matrix A e M™^" has i.i.d. components 
with Aij ^ A/'(0, 1/m) and the dimension m — m{n) is a 
deterministic function of n satisfying lim„_j.oo n/m — (3 
for some /3 > 0. 

(b) The input vectors x and initial condition x*' are determin- 
istic sequences whose components converge empirically 
with bounded moments of order s = 2p — 2 as 

lim(x,x°)^^i^'(X,X°), (6) 

n— J-oo 

to some random vector {X,X°) and p — 2. See 
Appendix |A] for the precise definition of this form of 
convergence. 

(c) The output vectors z and y e M'" are generated by 

z = Ax, and — h(zi, Wi) for all i = 1, . . . , m, (7) 

for some scalar function h{z,w) and vector w £ M™ 
representing an output disturbance. It is assumed that 
the output disturbance vector w is deterministic, but 
empirically converges as 

lim w ^^'^ (8) 

n— f oo 

where s = 2p — 2 is as in Assumption (b) and W is some 
random variable. 

(d) For every t, the adaptation function i?*(r,Tr) can be 
regarded as a functional over r satisfying the following 



weak pseudo-Lipschitz continuity property: Consider any 
sequence of vectors r = r'-"' and sequence of scalars 
Tr = Tr"'\ indexed by n satisfying 

lim r(") ''^^^^ i?*, lim t(")=t*, 

n— ^oo n-^oo 

where i?* and t* are the outputs of the state evolution 
equations defined below. Then, 

lim i/*(r("),r(")) = iJ*(i?*,T*). 

n— f oo 

Similarly, iJ*(y,p,rp) satisfies analogous continuity 
conditions in Tp and (y,p). See Appendix |a] for a 
general definition of weakly pseudo-Lipschitz continuous 
functionals. 

(e) The scalar function G\.{r,Tr,Xx) and its derivative 
G'*'x{r,Tr,Xx) with respect to r are continuous in A^; 
uniformly over r in the following sense: For every e > 0, 
t, T* and A* G A^;, there exists an open neighborhood U 
of (r*, A*) such that for all (r^, A^^) G U and r, 

\Giir,Tr,Xx)-GUr,T:,X:)\<e, 
\G'l{r,Tr,X.)-G'l{r,T:,X*x)\<e. 

In addition, the functions G'^(r, Tr,Xx) and G'l,{r, r^, A^) 
must be Lipschitz continuous in r with a Lipschitz 
constant that can be selected continuously in and 
Xx- The functions Gl{p,y,Tp,Xz), Gl{p,y,Tp,Xz) and 
their derivatives G'l{p,y,Tp, X^) G'l{p,y,Tp, Xz) satisfy 
analogous continuity assumptions with respect to p, y, Tp 
and Xz- 

The assumptions are similar to the basic framework of the 
original analysis of AMP in p7J and the non-adaptive GAMP 
algorithm in p2| in that we are considering the estimation 
of random vectors for large, Gaussian i.i.d. matrices A. The 
input vector x and output disturbance w are modeled as 
deterministic, but whose empirical distributions asymptotically 
appear as i.i.d. The remaining assumptions (d) and (e) are 
somewhat technical, but mild, continuity conditions that can 
be satisfied by a large class of adaptation functionals and 
estimation functions. For example, from the definitions in 
Appendix |A] the continuity assumption (d) will be satisfied 
for any functional given by an empirical average 

1 " 

where, for each t, <j>x{rj, Tr) is pseudo-Lipschitz continuous in 
r of order p and continuous in t,. uniformly over r. A similar 
functional can be used for As we will see in Section 

III-B[ functional formed by maximizing empirical averages 
such as (|5]l will also satisfy the conditions of this assumption. 
Now define the sets of vectors 

0* :={(a;„r*,jf ),j = l,...,n}, (9a) 
91 ■.^{{z,,zl,y,,pl),i^l,...,m}. (9b) 

The first vector set, 6**, represents the components of the 
the "true," but unknown, input vector x, its adaptive GAMP 
estimate x* as well as r*. The second vector, 0*, contains 
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the components the "true," but unknown, output vector z, its 
GAMP estimate z*, as well as p* and the observed input y. 
The sets 6** and 9^ are implicitly functions of the dimension 
n. Our main result. Theorem [T] below, will characterize the 
asymptotic joint distribution of the components of these two 
sets as n — ?► oo. Specifically, we will show that the empirical 
distribution of the components of B^. and 0* will converge to 
a random vectors of the form 



converge empirically with bounded moments of order p = 2 



(10) 



where X is the random variable in the initial condition (|6|. 
i?* and X* are given by 



X 



t+i 



(11) 

(12) 



for some deterministic constants a*, r* and that will 
be defined momentarily. Similarly, (Z, P*) ~ A/'(0, Kp, and 



Y = h{Z,W), Z' = G\{P\Y,tL\ 



(13) 



where W is the random variable in ([8]) and K* and are 
also deterministic constants. 

As shown in |1_2^|, the deterministic constants above can be 
computed iteratively with the following state evolution (SE) 
equations: The SE equations are initialized for i = with 



K!,! =E 



X 



X x"- 



(14) 



where the expectation is over the random variables {X, X'^) 
in Assumption [ijb). Then, for t > 0, we first compute: 

tI = /?r^, K*=/3K^, (15a) 
A* = HI{P\tI), (15b) 

= -E-4l-G*(P*,r,T* A*)l , (15c) 



dp 

(t*)2e [gI{P\Y,tIX) 
d 



a. 



-G^(P,/i(z,l¥),r* AJ 



(15d) 
(15e) 



where the expectations are over the random variables 
W^). Then, we compute the quantities associated with 



via the equations 

rT = 



d 



r*E 



E 



dr 
X 



gI{r\tI.,x^) 

X 



(16a) 
(16b) 

(16c) 



where the expectation is over the random variable {X, 

Theorem 1: Consider the random vectors 6** and 0\ gen- 
erated by the outputs of GAMP under Assumption [I] Let 6*^ 
and 0^ be the random vectors in ([TO]l with the parameters 
determined by the SE equations ( [T4] i, ( [T5] l and ([T6|. Then, 
for any fixed t, almost surely, the components of 0^ and 6\ 



as 



lim 

n^oo 



f/'^^-'ol lin.el^'i^^t (17) 

n— ^oo 



where ol. and 0* are given in ( [TOj i. In addition, for any t, the 
limits 



lim A^ — 
lim r* = T* , 

n— >-oo 



lim = A^, 

n— >-oo 

lim = r^, 



(18a) 
(18b) 



also hold almost surely. 
Proof: See Appendix [B] 

Similar to several other analyses of AMP algorithms such 
as 1 12 1, p4)-p7], the theorem provides a scalar equivalent 
model for the componentwise behavior of the adaptive GAMP 
method. That is, asymptotically the components of the sets 
6** and 9l in ( |9aj i are distributed identically to simple scalar 
random variables. The parameters in these random variables 
can be computed via the SE equations ( [T4] i, ^T5\ and ([T6|, 
which can be evaluated with one or two-dimensional integrals. 
From this scalar equivalent model, one can compute a large 
class of componentwise performance metrics such as mean- 
squared error (MSE) or detection error rates. Thus, the SE 
analysis shows that for, essentially arbitrary estimation and 
adaptation functions, and distributions on the true input and 
disturbance, we can exactly evaluate the asymptotic behavior 
of the adaptive GAMP algorithm. 

B. Asymptotic Consistency with EM-GAMP-Like Adaptation 

The general result. Theorem [T] can be used to prove the 
asymptotic consistency of a variant of the EM-GAMP method 
of Vila and Schniter fH, |[10|. Recall from Section |II] that 
we assume the distributions Px and Py\z are parameterized 
as in ([TJ, and the adaptive GAMP algorithm is run with 
the sum-product estimation functions in (|2]) along with log- 
likelihood maximization adaptation functions in (|5]l. To prove 
the asymptotic consistency of the EM-GAMP method, we 
make the following two assumptions that can be regarded as 
identifiability conditions: 

Assumption 2: The family of distributions, Px{x\\x), G 
Kx satisfies the following conditions: 

(a) The parameter space A^^ is a compact topological space. 

(b) There exists a score function 4>x(,r,Tr, Xx) such that for 
any "true" parameter A* G A^;, the maximization 

A:, = &TgmlixE[(l)x{X + V,Tr,Xx)\Tr,X*x] , (19) 

is well-defined, unique and returns the true value, Xx = 
A*. The expectation in ^T9\ is with respect to X ^ 
Pxi-\Xl) and l/^AA(0,r,). 

(c) For every A^: and r^, the score function 4>x{r,Tr, Xx) is 
pseudo-Lipschitz continuous of order p = 2 in r. In 
addition, it is continuous in , Aa; uniformly over r in 
the following sense: For every e > 0, there exits an open 
neighborhood U of r,., A^;, such that for all (t^, A^:) € U 
and all r, 

\(l>xir,Tr,Xx) - (l)xir,Tr,Xx)\ < C. 
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Assumption 3: Consider a family of conditional distribu- 
tions, PY\z{y\z, Xz), Az e Az generated by the mapping 
Y = h{Z,W, Xz) where W ~ Pw is some random variable 
and h{z,w,Xz) is a scalar function. The conditional distribu- 
tion family satisfies the following assumptions: 

(a) The parameter space Az is a compact topological space. 

(b) There exists a score function 4>z{yTP,Tp,Tzo,Xz) such 
that for any "true" parameter A* G and Tp, Tzq with 
< Tp < Tzo, the maximization 

Xz = argmaxE [(l)z{Y, P, Tp, Tzo, Xz)\Tp, r^o, A*] , (20) 

is well-defined, unique and returns the true value, Xz — 
A*, The expectation in ( |20| i is with respect to Y\Z ^ 

PY\z{y\z,X:) and {Z,P) ^ M{0,Kp{Tp,T,o)) where 
Kp(Tp,Tzo) is the covariance matrix 



Jp,Tz0) 



TzO 



TzO 



TzO 
TzQ 



(21) 



(c) For every Xz, Tp, Tzq and y, the score function 
4'z{y,P,Tp,Tzo, Xz) is pseudo-Lipschitz continuous in p 
of order 2. In addition, it is continuous in Tp, Xz uniformly 
over p and y. 

Assumptions |2] and |3] essentially require that the parameters 
Aa; and A^ can be identified under arbitrary Gaussian mea- 
surements. This assumption is essential since the SE analysis 
shows that the GAMP algorithm produces estimates that are 
AWGN corrupted observations of the true values. Note 
that the natural selection of the score functions are the log 
likelihoods: 

<j)^{r,Tr,Xx) ^ log p{r\Tr,Xx), 
<t>z{y,P,Tp,Xz) = \ogp{y\p,Tp,Xz). 

These selection will correspond to the adaptation functions in 
(|5]l. However, our analysis allows other score functions in case 
when one may not want to perform the maximization of the 
exact log-likelihood for computational reasons. 

Assumption 4: Let Px{x\Xx) and PY\z{y\z,Xz) be fami- 
lies of distributions satisfying Assumptions [2] and [3] respec- 
tively. Consider the adaptive GAMP algorithm. Algorithm [T] 
run on a sequence of problems, indexed by the dimension n 
satisfying the following assumptions: 

(a) The matrix A is generated as in Assumption [TJa). 

(b) The true vector x converges empirically with bounded 
moments of order s = 2p — 2 for p = 2 to a distribution 
Px(-|A*) for some true parameter A* G A^^. Although, 
we do not assume the estimator knows the true parameter 
A* , we assume the mean and variance can be estimated to 
initialize the algorithm. That is, the initial conditions are 
set to x° = fixo IE(X|A*) for all j and t° = t^q := 
var(X|A*). 

(c) The output vectors z and y G M" are generated by z = 
Ax, and yi = h{zi,Wi, XI) for some true parameter A* G 
Az where the vector w G M™ is empirically converges 
with bounded moments s — 2p— 2 to the random variable 
W in Assumption |3] 



(d) The adaptation functions are given by the maximization 
of the score functions 

1 " 

Hl{r,Tr) = argmax- V'(/)j;(rj,Tr,Aj,), (22a) 



AxGA, n 



Hlip,y,Tp) 



= arg max — 0z (yi , Pi , Tp , t^q , A^ ) , (22b) 
A=eA. 

where Tzo = (3t^o- 
(e) The estimation functions are set to the sum-product 
GAMP estimation functions (|2| and are assumed to 
satisfy the continuity conditions in Assumption [TJc)- 
The assumptions states that the adaptive GAMP algorithm is 
run with the estimation for the sum-product GAMP algorithm 
along with adaptation functions based on maximizing a score 
function. This procedures includes the variant of the EM- 
GAMP procedure of f9l, described in Section |Il] The 
somewhat strong condition is that we require that we can 
initialize the algorithm to the true mean and variance. If this is 
not possible, then we would need to modify the adaptation to 
estimate the gain a^. and noise t* at the inputs. The analysis is 
however somewhat more involved and we omit it to simplify 
the proofs. 

Theorem 2: Consider the outputs of the adaptive GAMP 
algorithm under Assumption |4] Then, the parameter estimates 
A^ and A* are asymptotically consistent in that, for all t, they 
almost surely converge to the true parameter values 



lim A,. = A^ = A* 



lim A* = A = A! 



(23a) 
(23b) 



are the outputs of the SE equations. 
'* and 0* in (|9al, almost surely converge 



where A^. and A^ 
Moreover, the sets 
empirically with bounded moments of order p to limits 9"^ and 
6'z that are exactly the same as the limits for the corresponding 
oracle algorithm with parameter estimates set to their true 
values. That is, (A^, A*) = (AJ, A*). 
Proof: See Appendix [C] 

The theorem shows, remarkably, that for a very large 
class of the parameterized distributions, an EM-GAMP-like 
algorithm is able to asymptotically estimate the correct param- 
eters. Moreover, there is asymptotically no performance loss 
between the adaptive GAMP algorithm and a corresponding 
oracle GAMP algorithm that knows the correct parameters. 

In addition, the limiting distributions for the outputs of the 
oracle sum-product GAMP algorithm, and hence the adaptive 
sum-product GAMP algorithm, have a particularly simple 
description: The oracle sum-product GAMP algorithm with 
true parameters, {X\.,X\) = (A*, A*), corresponds to the 
standard non-adaptive sum -product GAMP algorithm of |12|. 
The analysis in p2| shows that the SE equations ( [T4] i, ([TSj and 
( fTSI for this case simplify considerably: Specifically, define the 
Fisher information: 



F(tP,Az) =E 



\0gPY\p{Y\P,Tp,Xz 



(24) 
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where PY\p{y\p,Tp, Xz) is the Hkelihood of function of Y 
given P for the variables in Assumption [3jb). Then, the SE 
equations for sum-product GAMP reduce to 



1 



K 

T*+^ = E [var(X|i?*)] 
ai = 1. 



(25a) 
(25b) 

(25c) 

(25d) 
(25e) 



An important feature of these SE equations, is that, as t ^ oo, 
the values always converge to some fixed point |16|. More- 
over, when the fixed points are unique, the values match the 
predicted optimal mean-squared error performance in certain 
scenarios | [T6| , J22) . Thus, in these situations, the adaptive 
GAMP algorithm is not only asymptotically consistent, but 
potentially mean-squared optimal. 

IV. Numerical Example: Estimation of a 
Gauss-Bernoulli input 

Recent results suggest that there is considerable value in 
learning of priors Px in the context of compressed sensing 
ijSTl, which considers the estimation of sparse vectors x from 
underdetermined measurements (m < n) . It is known that es- 
timators such as LASSO offer certain optimal min-max perfor- 
mance over a large class of sparse distributions |24 |. However, 
for many particular distributions, there is a potentially large 
performance gap between LASSO and MMSE estimator with 
the correct prior This gap was the main motivation for |9|, 1 10 1 
which showed large gains of the EM-GAMP method due to its 
ability to learn the prior. Here, we present a simple simulation 
to illustrate the performance gain of adaptive GAMP and 
its asymptotic consistency. Specifically, Fig. |2] compares the 
performance of adaptive GAMP for estimation of a sparse 
Gauss-Bernoulli signal x e M" from ni noisy measurements 

y = Ax + w, 

where the additive noise w is random with i.i.d. entries 
Wi ~ 7V(0,cr2). rpj^g gjgjj^j jgjjgjjj ^ ^ jjj^g 20% 

nonzero components drawn from the Gaussian distribution 
of variance 5. Adaptive GAMP uses EM iterations to jointly 
recover the unknown signal x and the true parameters = 
{p = 0.2,(7^ = 5). The performance of adaptive GAMP is 
compared to that of LASSO with MSE optimal regularization 
parameter, and oracle GAMP that knows the parameters of 
the prior exactly. For generating the graphs, we performed 
1000 random trials by forming the measurement matrix A 
from i.i.d. zero-mean Gaussian random variables of variance 
1 /m. In Figure |2|a), we keep the variance of the noise fixed 
to (T^ = 0.1 and plot the average MSE of the reconstruction 
against the measurement ratio m/n. In Figure |2]^b), we keep 
the measurement ratio fixed to m/n = 0.75 and plot the aver- 
age MSE of the reconstruction against the noise variance a^. 
For completeness, we also provide the asymptotic MSE values 
computed via SE recursion. The results illustrate that GAMP 
significantly outperforms LASSO over the whole range of 



m/n and cr^. Moreover, the results corroborate the consistency 
of adaptive GAMP which achieves nearly identical quality of 
reconstruction with oracle GAMP. The performance results 
indicate that adaptive GAMP can be an effective method for 
estimation when the parameters of the problem are difficult to 
characterize and must be estimated from data. 

V. Conclusions and Future Work 

We have presented an adaptive GAMP method for the 
estimation of i.i.d. vectors x observed through a known linear 
transforms followed by an arbitrary, componentwise random 
transform. The procedure, which is a generalization of EM- 
GAMP methodology of [9|, |10|, estimates both the vector x 
as well as parameters in the source and componentwise output 
transform. In the case of large i.i.d. Gaussian transforms, it is 
shown that the adaptive GAMP method is provably asymptoti- 
cally consistent in that the parameter estimates converge to the 
true values. This convergence result holds over a large class of 
models with essentially arbitrarily complex parameterizations. 
Moreover, the algorithm is computationally efficient since it 
reduces the vector-valued estimation problem to a sequence of 
scalar estimation problems in Gaussian noise. We believe that 
this method is appUcable to a large class of linear-nonlinear 
models with provable guarantees can have applications in 
a wide range of problems. We have mentioned the use of 
the method for learning sparse priors in compressed sensing. 
Future work will include learning of parameters of output 
functions as well as possible extensions to non-Gaussian 
matrices. 

Appendix A 
Convergence of Empirical Distributions 

Bayati and Montanari's analysis in 1 17] employs certain de- 
terministic models on the vectors and then proves convergence 
properties of related empirical distributions. To apply the same 
analysis here, we need to review some of their definitions. We 
say a function (/) : M'' ^ M'* is pseudo-Lipschitz of order 
p > 1, if there exists an L > such for any x, y G W , 



||0(x)-<^(y)|| <i(l + ||x| 



r')l|x-y|| 



Now suppose that for each n — 1,2, 
vectors 

v(") ={v,(n),j = l,...,£(n)}, 



v^") is a set of 



(26) 



where each element Vi(n) e W and t{n) is the number of 
elements in the set. Thus, v*^"^ can itself be regarded as a 
vector with sl{n) components. We say that v*^"^ empirically 
converges with bounded moments of order p as n — > oo to a 
random vector V on W if: For all pseudo-Lipschitz continuous 
functions, 0, of order p. 



1 

lim — y 



(/.(v,(n))=E(0(V))<oo. 



When the nature of convergence is clear, we may write (with 
some abuse of notation) 



V as n ^ oo, 
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Fig. 2: Reconstruction of a Gauss-Bernoulli signal from noisy measurements. The average reconstruction MSB is plotted 
against (a) measurement ratio m/n and (b) AWGN variance a^. The plots illustrate that adaptive GAMP yields considerable 
improvement over ^i-based LASSO estimator. Moreover, it exactly matches the performance of oracle GAMP that knows the 
prior parameters. 



or 



Finally, let be the set of probability distributions on 
with bounded pth moments, and suppose that : — > A is 
a functional to some topological space A. Given a set v'") 
as in p6l ), write i?(v) for H{P^) where is the empirical 
distribution on the components of v. Also, given a random 
vector V with distribution Pv write H(V) for i/(Pv)- Then, 
we will say that the functional H is weakly pseudo-Lipschitz 
continuous of order p if 

lim v(") ^^^^^ V =^ lim i7(v(")) = H{Y), 

where the limit on the right hand side is in the topology of A. 

Appendix B 
Proof of Theorem[T] 

The proof follows along the adaptation argument of 
[ [25] . We use the tilde superscript on quantities such as 
X*, f*, f*, p*, Tp, s*, and z* to denote values generated via a 
non-adaptive version of the GAMP. The non-adaptive GAMP 
algorithm has the same initial conditions as the adaptive 
algorithm (i.e. x° = x'^jfp = Tp,sr^ = = 0), but with 
and A* replaced by their deterministic limits A^ and A*, 
respectively. That is, we replace lines IT] [H] and [15] with 



Gi{plyi,T'p,X,), S* = Gl{ply^,T'p,X,), 
r<t / t t T* \ 



This non-adaptive algorithm is precisely the standard GAMP 
method analyzed in p2) . The results in that paper show 
that the outputs of the non-adaptive algorithm satisfy all the 
required limits from the SE analysis. That is. 



lim ei'"i'^t. 



where 6** and 9l are the sets generated by the non-adaptive 
GAMP algorithm: 



):j = l,...,n}, 

§1 = {{z,,zj,y„pl) ■.i = l,...,m}. 



The limits ([TtJi are now proven through a continuity argu- 
ment that shows that the adaptive and non-adapative quantities 
must asymptotically agree with one another. Specifically, we 
will start by proving that the following limits holds almost 
surely for all t > 



lim = lim -||x* 

n— )-oo n— J-oo 77, 

lim Ai = lim |T*-f*| 



-x*||^ = 0, 







(27a) 
(27b) 



where || • ||p is usual the p-norm. Moreover, in the course of 
proving ( |27] i, we will also show that the following limits hold 
almost surely 



lim A^ 



1 



lim 

m— >-oo m 

lim A* = lim -||r* -f*||P = 0, 



1 



p*-p*||P = 0, (28a) 
(28b) 
(28c) 



lim A* = lim — ||s* -s*||P = 0, 

m— >-oo m— ^-oo 777, ^ 

lim A* = lim — ||2*-i*||P = 0, 

m—>-oo m—>-oo JJI ^ 

lim A^ = lim \tI -tI\^ 0, 

n—^oo n-^oo 

lim A* - 

n—>-oo 

_ \t 



lim A; 

n— J-oo 



A* 



(28d) 
(28e) 
(28f) 
(28g) 



The proof of the limits ( [27| i and ( [28) is achieved by an 
induction on t. Although we only need to show the above 
limits for p = 2, most of the arguments hold for arbitrary 
p > 2. We thus present the general derivation where possible. 

To begin the induction argument, first note that the non- 
adaptive algorithm has the same initial conditions as the 
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adaptive algorithm. Thus the hmits ( |27) l and ( |28c| i hold for 

t = and t ~ —1, respectively. 

We now proceed by induction. Suppose that t > and the 
limits pT] ) and ( |28c| l hold for some t and t~l, respectively. 
Since A has i.i.d. components with zero mean and variance 
1/m, it follows from the Marcenko-Pastur Theorem |26| that 
that its 2-norm operator norm is bounded. That is, there exists 
a constant Ca such that 



lim \\A\\p<Ca, lim ||A^||p<C^. (29) 



This bound is the only part of the proof that specifically 
requires p — 2. From (|29|, we obtain 



1p*-P*|1p = |1Ax* 



|A(x'-x')+T;(s'-^-s'-^) + (f; 



T„ S 



< ||A(X* - + |r*|||s*-i - s"||, + |f* - r*|||s*-i||, 

a) 

-' II A II net I Ut|||p;t-1 ■^'^-'-'i -t- 



|r^|||s"-s-i||,. 



|r*-r;|||s'-i||, 

i^;-^plP"llp 



(30) 



where (a) is due to the norm inequality ||Ax||p < || A||p||x||p. 
Since p > 1, we have that for any positive numbers a and b 



{a + b)P < 2P{aP + bP). 



Applying the inequality ( (3T] i into ( [30) 1, we obtain 



(31) 



m 

< 



\p'-pX 



i*iip + k;iiis*-i-§*-^i 



(32) 



pseudo-Lipschitz continuous function (j) of order Then 
m ^ — ' 

i=l 
^ m 

m ^ — ' ' ' 

i=l 
m 

-J2Hply^)-K[<PiP\Y)] 

i=\ 
r m 

<- ^ (1 + \pir' + ipir' + \y.r') \p\ - pi 

m 

^ m 

-^0(p^2/,:)-E[0(P*,r)] 



(b) 

<LCA* 



(34) 



In (a) we use the fact that is pseudo-Lipschitz, and in (b) we 
use Holder's inequality jx'^yl = ||x||p||y||q with g = 1) 
and define C as 



C ■-- 



p-i\p/(p-i 



i=l 

< const X 
1 



1+ - P 
m 



t\\v 



-mi 



|y||^ 



(35) 



where the first step is from Jensen's inequality. Since (p*,y) 
satisfy the limits for the non-adaptive algorithm we have; 

lim -WWl = 1™ -E = IE [|PT] < ^ (36a) 

2 = 1 

lim -||y||P = lim - V = E < oo (36b) 



Now, since s* and are the outputs of the non-adaptive 
algorithm they satisfy the limits 



lim 



lim 



,p - Tp < CX). 



1 Y-l.t 



i=l 



s*|P = E[|S'Y] <oo, (33a) 
(33b) 



Now, the induction hypotheses state that A^, A*~^ and A^^ 
0. Applying these induction hypotheses along the bounds 
P3a| i, and the fact that n/m ^ (3 we obtain ( |28a[ ). 

To prove ( 28g[ l, we first prove the empirical convergence 
of (p*,y) to (P*,y). Towards this end, let <f>{p,y) be any 



Also, from the induction hypothesis P8a| i, it follows that the 
adaptive output must satisfy the same limit 

1 1 

lim -||p*E= lim -y\pl\P = E\\P*\P] < oo. (37) 

1=1 

Combining (|34|, ([35), (|36|, (|37|, (|28a]l we conclude that for 
alU > 



lim (p*,y) ^"i^^ iP\Y). 



(38) 



The limit ( (38] l along with ( |27b[ ) and the continuity condition 
on Hl{-) in Assumption [ijd) prove the limit in ( |28g| l. 

The limit p8a| i together with continuity conditions on 
in Assumptions [T] show that ( |28c[ ), ( |28d| i and ( |28e[ ) hold for i. 
For example, to show (|28d[), we consider the limit rn ^ oo 
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of the following expression 
1 1 



m ^ m 



G*(pSy,r;,A*)-G*(pSy,r;,A*)||P 



<- P* 

m 



P*ll^-iA*, 



where at (a) we used the Lipschitz continuity assumption. 
Similar arguments can be used for (|28c| and ( |28e[ ). 

To show ( |28b| l, we proceed exactly as for ( |28a| i. Due to 
the continuity assumptions on H^, this limit in turn shows 
that (|28§ holds almost surely. Then, (|27a]i and pTb] ) follow 
directly from the continuity of Gx in Assumptions [T] together 
\2%h) and ( |28f| l. We have thus shown that if the limits ( pT] ) 



with 



and ( |28| ) hold for some t, they hold for Thus, by induction 
they hold for all t. 

Finally, to show ( [T7] i, let be any pseudo-Lipschitz contin- 
uous function (/)(a;,r, x), and define 



E 



(39) 



which, due to convergence of non-adaptive GAMP, can be 
made arbitrarily small by choosing n large enough. Then, 
consider 



)-E (j){X,R\X'+') 



n 



(a) , 

<ei + L\\r 

L' 

H 

n 



|0( 



n 



(1= 



I) 



^11 



if I 



assumptions in AssumptionJTJe) and the convergence of the 
empirical distributions in [Tff . This completes the proof. 

Appendix C 
Proof of Theorem|2] 

We first consider the SE equations (\A\ to ([T6|, and claim 
that with the estimation functions (|2| and adaptation functions 
(p2|), the parameter estimates equal the true values: 



(41) 



for all t. We prove this by induction. Suppose ( |4T] l is true 
for all times t < s for some iteration number s. Then, the 
estimation functions (|2| for < < s are precisely the estimation 
functions for the sum-product GAMP algorithm with the true 
distributions. As discussed at the end of Section |III-B| the 
solutions of these SE equations satisfy ( |25] l. This implies that 
the random variables {Z, Y, P*) in ( pjj ) reduce to 



(Z,P*)^AA(0,Kp(Tp), 
Therefore, 



Y = h{Z, W, A* 



(42) 



(4) 
if) 
(d) 



argmaxE P*, r* , r^o, A^)] 

AjSA^ 

argmaxE[0,(/i(Z,iy,A:),P*,T*,T,o,A,) | A^] 
A^eA^ 



(43) 



(b) 1 1 

<ei + L{Aiy +L{Aiy 

+ L' (A* )^ ((M^)^ + {ACr-^ + {MlY-^ + (Af^)^ 



where (a) is from the SE equation ( |15b| ); (b) is from the 
definition of the adaptation function (22}; (c) follows from 
( |^| ) and (d) is precisely Assumption Sj^b). This shows that 
A^ = A*. A similar arguments shows A^ = A*. Therefore, we 
have shown that if is valid for all t < s, it is valid for 
t = s. By induction, it is valid for all t. In addition, we have 
also shown that the solutions to the SE equations are identical 
to the SE equations for the oracle sum-product algorithm with 
the correct parameters. 

So, it remains to show the limits ^TT) and ( |18a| i. But, 
these limits are a direct application of the general result. 



(40) 



where L, L' are constants independent of n and 



1 




Ml = 


1 


n 


1 lip ' 


n 


1 


ii*+iir, 


Ml = 


1 


11 


1 lip 


n 



Theorem III-A To apply the general result, first observe that 
Kssumptions |4|a), (b), (c) and (e) immediately imply the 
corresponding items in Assumptions [T] So, we only need to 
verify the continuity condition in Assumption [TJd) for the 
adaptation functions in ( p2| . 

We begin by proving the continuity of H^. Fix t, and let 
(y*^"\p("') be a sequence of vectors and Tp"' be a sequence 
of scalars such that 



In (a) we use the fact that is pseudo-Lipshitz, in (b) we 
use ^p-norm equivalence ||x||i < rt^~^/^'||x||p and Holder's 
inequality Ix^y] ^ ||x||p|jy||q with q = p/{p-l). By applying 
of ( |27al ), ( |28bl i and since, Af*+\ A/*+\ A/^, and Ml converge 
to a finite value we can obtain the first equation of ([TTJl by 
taking n — > 00. The second equation in ([TtJi can be shown in 
a similar way. This proves the limits ( [T7] i. 

Also, the first two limits in ( |18a[ ) are a consequence of 
(|28f|) and P8f|). The second two limits follow from continuity 



lim (y("),p("))^^^^) (y,p*) 



lim ri") = tL 



(44) 



where (F, P*) and are the outputs of the state evolution 
equations. For each n, let 



(45) 



We wish to show that Az — > A*, the true parameter Since 
aI"^ G and is compact, it suffices to show that, any 
limit point of any convergent subsequence is equal to A*. 
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So, suppose that A 
subsequence A 



(n) 



A^ to some limit point A on some 



From aI"^ and the definition (|22| it follows that 



> 



■J2Myl''\p^\4-\K). (46) 



Now, since Tp"-* — > and Az"^ A, we can apply the 
continuity condition in Assumption ^c) to obtain 



(n) 



liminf^^k(yf),p("),r*, 



K) 



> 0. 



(47) 



Also, the limit ( |44| i and the fact that (p^ is psuedo-Lipschitz 
continuous of order p implies that 



(48) 



But, ( |43| l shows that A* is the maxima of the right-hand side, 
so 

E[Uy, P\rl, K)] - E[0,(r, P*, r* , A:)]. 



(49) 



Since, by Assumption |3|b), the maxima is unique, A^ = A*. 
Since this limit point is the same for all convergent subse- 
quences, we see that aI"^ A* over the entire sequence. 
We have thus shown that given limits ( |44] l, the outputs of the 
adaptation function converge as 

Thus, the continuity condition on ) in Assumption [T|d) 
is satisfied. The analogous continuity condition on ) can 
be proven in a similar manner. 

Therefore, all the conditions of Assumption [Tlare satisfied 
and we can apply Theorem [T] Thus, the limits p7| ) and ( |18a[ ) 
are valid almost surely and the proof is complete. 
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